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A self-organising neural network is presented that is based on a rigorous Bayesian 
analysis of the information contained in individual neural firing events. This 
leads to a visual cortex network (VICON) that has many of the properties 
emerge when a mammalian visual cortex is exposed to data arriving from two 
imaging sensors (i.e. the two retinae), such as dominance stripes and orientation 
maps. 

1 Introduction 

The overall goal of this work is to automate as far as is possible the processing 
of data from multiple sensors (data fusion) , which includes the automatic design 
of the architecture and functionality of the network(s) that do the processing. 
In a novel approach to this automation problem was introduced, and the 
purpose of this paper is to refine and extend the previously reported results. 

The problem of automating the design of a data fusion network has many 
interesting special case solutions. In particular, the type of self-organising neural 
network (in the mammalian visual cortex) that processes the images arriving 
from a pair of retinae is one such special case, where the number of sensors 
involved is just two. For a review of visual cortex neural network models see 

mm- 

The basic idea is to use a soft encoder (i.e. its output is a distributed code 
in which more than one, and possibly all, of the output neurons is active) to 
transform the input vector (i.e. the input image) into a posterior probability 
over various possible classes (i.e. alternative possible interpretations of the input 
vector), and to optimise the encoder so that this posterior probability is able to 
retain as much information as possible about the input vector, as measured in 
the minimum mean square reconstruction error (i.e. L2 error) sense llOj. 



*This paper was submitted to Network on 6 November 1996. Paper reference 
NET/79294/PAP. It was not accepted for publication, but it underpins several subsequently 
published papers. 
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In the special case where the optimisation is performed over the space of 
all possible soft encoders, the optimum solution is a hard encoder (i.e. it is 
a "winner-take-all" network in which only one of the output neurons is active) 
which is an optimal vector quantiser (VQ), of the type described in [3], for 
encoding the input vector with minimum L2 error. In the slightly less special 
case where the space of possible soft encoders is restricted to include only those 
whose output is deliberately damaged by the effects of a noise process, this 
produces a different type of hard encoder which is an optimal self-organising 
map (SOM) for encoding the input vector with minimum L2 error; this is very 
closely related to the well-known Kohonen map [3], as was demonstrated in [S]. 

This paper will examine yet another special case, where the optimisation is 
performed over a very special subspace of soft encoders, rather than over all 
possible soft encoders. The behaviour of each soft encoder is modelled by a set 
of posterior probabilities over various possible classes. When a particular para- 
metric form for these posterior probabilities is chosen, a corresponding subspace 
of possible soft encoders is thus automatically selected, which may be explored 
by varying the parameters. The parametric form of the posterior probability 
that is used in this paper is based on the so-called partitioned mixture distri- 
bution (PMD) [71 [S], which is a natural generalisation of the standard mixture 
distribution to a high-dimensional input space. 

This use of a PMD leads to a 2-layer visual cortex network (VICON), where 
the components of the input vector are the output activities of the input neurons, 
and the components of the PMD posterior probability are the output activities 
of the output neurons. Various physically realistic constraints are placed on the 
PMD optimisation (both on the internal PMD structure, and on the type of 
training data that is used), and these will be described in the text as they arise. 

The layout of this paper is as follows. In section [5] all of the necessary 
theoretical machinery is developed, including folded Markov chains, posterior 
probability models, derivatives of the objective function, and receptive fields. In 
section 13] the concepts of dominance stripes and orientation maps are explained, 
both in the context of the elastic net model, and in the context of theory pre- 
sented in this paper. In section 2] the results of computer simulations are pre- 
sented, including both 1 and 2-dimensional retinae, single and pairs of retinae, 
both for synthetic and natural training data. In appendix |B] some explicit op- 
timal solutions that minimise the objective function are derived, including the 
periodicity property of some types of optimal solution. 

2 Theory 

This section covers all of the basic theoretical machinery that is required to de- 
sign and train a 2-layer VICON. In section [2TT] the theory of folded Markov chains 
(PMC) is summarised. In section 12.21 the basic idea of a posterior probability 
model is introduced, and in section [^751 this is developed into a full partitioned 
posterior probability model. In section the derivatives of the FMC objective 
function are derived assuming a partitioned posterior probability model, and 
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Figure 1: (a) A Markov chain of transitions x -^y^y'^ x'. (b) The same 
diagram as (a), but folded. 



in section 12.51 the influence of finite-sized receptive fields on these derivatives is 
derived. 

2.1 Folded Markov Chain 

The basis of the entire theoretical treatment is a communication channel model 
^ in which an input vector x is encoded to produce a conditional probabil- 
ity Pr(?/|x) over code indices y, which is then transmitted along a distorting 
communication channel to produce a conditional probability Pr {y'\y) over dis- 
torted code indices y', which is finally decoded to produce a conditional PDF 
Pr (x'|y') over reconstructions x' of the original input vector x. The three steps 
in the sequence x —i>j/— x' are modelled by the conditional probabilities 
Pr (y|x), Pr {y'\y), and Pr {x'\y'), which describe a Markov chain of transitions, 
which is shown diagrammatically in figure [2TT a). Pr(x'|?/) is completely de- 
termined from other defined quantities by using Bayes' theorem in the form 

Vr(^\lA - Pi'(x)Prfalx) 

-t^i \^\y) — /dx'Pr(x')Pr(a|x')' 

Because x and x' live in the same vector space it is convenient to fold this 
diagram to produce figure l^TTT b). this is called a folded Markov chain (FMC) [5]. 
Figure I^TlT b) is directly related to a 2-layer unsupervised neural network, where 
X and x' represent the activity pattern of the whole set of neurons in the input 
layer, and y and y' represent the location(s) of winning neuron(s) in the ouput 
layer. The overall conditional PDF generated by an FMC is Pr(x'|x), which 
is obtained by marginalising y and y' in the joint probability Pr (x', j/', j/|x) = 
Pr(x'|y')Pr(y'|y)Pr(2/|x). 

Define a network objective function 13 as [6] 

Z,./...x'P..WP..(x1x,||x-x'f (I) 

which measures the expected Euclidean (or L2) reconstruction error caused 
by feeding input vectors sampled from Pr (x) into the FMC, where each x is 
returned as a PDF Pr(x'|x) of alternative reconstructions x' of x. For simplicity, 
assume that the communication channel has been assumed to be distortionless 
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so that Pr {y'\y) — Syy', and that y — 1,2, ■ ■ ■ , M , then 



M 



D = J2 [ '^xdx'Pr(x)Pr(y|x)Pr(x'|y) ||x-x'||^ (2) 

An FMC is completely described by the form of its encoder Pr (?;|x) and the 
form of its reconstruction error ||x — x'||^. The functional form of the encoder 
may be chosen arbitrarily, and independently of the assumed Euclidean form 
of the reconstruction error, so the FMC does not correspond to a Gaussian 
mixture distribution model in input space. This is a general result for FMCs 
in which the functional forms of the encoder and the reconstruction error may 
be independently chosen. It is only when these functional forms are carefully 
chosen that a density model interpretation of an FMC is possible (for instance 
a Euclidean reconstruction error ||x — x'|| must be paired with an encoder 
Pr(y|x) that describes the posterior probability over class labels that would 
arise in a Gaussian mixture distribution model). 

The expression for D given in equation [5] may be simplified to yield [6] (this 
readily generalises to the case where Pr(y'|y) ^ Syy' (i.e. the communication 
channel causes distortion)) 

„ M 

D^2 dxPr(x)^Pr(y|x)||x-x'(2;)f (3) 

!/=l 

where x' (y) is a reference vector defined as 

x'(2/) = y(ixPr(x|2/)x (4) 

If this definition of x' (y) is not used, and instead D in equation [3] is minimised 
with respect to x' (y), then the stationary solution is x' (y) = J dxPr(x|y)x, 
which is consistent with the definition in equation 2] In practice, it is better 
to determine the stationary x' (y) by following the gradient Q^l^y-j than to use 

the explicit expression J dxPr (x|y) x for the stationary point, because g^,^^^ is 
cheap to evaluate whereas f dxPr (x|y) x is expensive to evaluate. In effect, the 
g^^y) approach is an example of on-line training, whereas the J (ixPr(x|y)x 
approach is the corresponding example of batch training, and the on-line and 
batch approaches each have their own areas where they are best used. 

In equation [3] Pr (y I x) is a "recognition model" (i.e. it takes an input vector 
and recognises by assigning to it a posterior probability over class labels) and 
x' (y) is the corresponding "generative model" (i.e. it takes a class label and 
generates a corresponding vector in input space). This is a simpler type of 
generative model than appeared in the original expression for D in equation [51 
where the generative model is Pr(x'|y), which generates a whole distribution 
of possible vectors in input space, rather than just a single vector which is the 
centroid of Pr(x'|y). The transformation of the FMC from one that uses the 
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Figure 2: A neural network representation of a folded Markov chain. 

PDF Pr(x'|?/) into one that uses the reference vector x' [y) is not possible in 
general; it was made possible here by choosing to use a Euclidean reconstruction 
error in D. In general, an FMC reconstruction is a distribution over alternative 
inputs, rather than a single representative input, as might be used in decision 
theory, for instance. 

The operation of the various terms in the expression for D in equation [3] is 
shown in figure [231 which is rotated through 90° anticlockwise with respect to 
the corresponding diagram in figure 12.11 and also for simplicity y' — y because 
Pr {y'\y) = 6yy' was assumed above. When D is minimised with respect to the 
choice of encoder Pr {y\x.) and reconstruction vector x' (y) it yields a standard 
minimum mean square error vector quantiser (VQ) with M code indices [4J, and 
if Pr {y'\y) ^ Syyi then the VQ produces code indices that carry information in 
such a way that it is maximally robust with respect to the damaging effects of 
communication channel distortion modelled by Pr {y'\y) [5]. This latter type 
of VQ can be shown to be approximately equivalent to a self-organising map 
(SOM) of the type introduced by Kohonen [3]. 

2.2 Basic Posterior Probability (Single Recognition Model) 

The minimisation procedure that leads to a VQ-like optimum assumed that 
the entire space of posterior probability functions Pr(?/|x) was available to be 
searched. In the neural network interpretation, Pr(2/|x) models the probability 
that neuron y fires first (this encompasses both the case of a soft encoder where 
more than one neuron can potentially fire first, and the case of a hard encoder 
where only one neuron can potentially fire first; this is the winner-take-all case), 
which depends on the detailed underlying dynamics of how all of the neurons 
interact with each other. Because these neural dynamics are not arbitrary (e.g. 
they are constrained to be a physically realisable process), it constrains the 
space of possible posterior probabilities Pr {y\x.) that is available to the neural 
network. Pr (y|x) may then be modelled by the functional form 



Pr(j/|x) 



Q(x|2/) 



(5) 
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where Q (x|y) is the raw "response function" of neuron y. 

Q (x|?/) may be interpreted as the raw firing rate of neuron y, and Pr (?/|x) 
is then the probability that neuron y fires first out of ah of the M competing 
neurons. This functional form makes it clear that there is a type of lateral 
inhibition occurring between Pr(j/i|x) and Pr(j/2|x) (for yi ^ yq), because if 
the raw firing rate Q (x|yi) is increased so that Pr (yi|x) increases, nevertheless 
the denominator X]j,'=i Q (^1^') ensures that Pr(j/2|x) decreases (for y\ ^ 1/2); 
i.e. the Q (x|?/) do not exhibit lateral inhibition, but the Pr(?/i|x) do exhibit 
lateral inhibition. 

The raw receptive field of a neuron depends on the form of Q (xjy). Thus 
if the functional form of Q (x|y) depends only on a subset x {y) of components 
of X, then x (y) is the raw receptive field of neuron y. However, this is not the 
same as the the receptive field that is effective in producing the first firing event, 
because Pr (y|x) depends on all of the x (y') (for y' = 1, ■ • • , M) as shown in 
equation [S] 

The effect of the distortion Pr {y\y') process, as modelled by Pr {y\y')., is to 
alter at the last minute, as it were, the probability that each neuron fires first. 
Thus the posterior probability is modified as follows 

M 

Pr(j/|x)^ ^Pr(y|2/')Pr(2/'|x) (6) 

where the matrix element Pr(?/|y') leaks posterior probability from neuron y' 
onto neuron y. Such cross-talk amongst the neurons exists independently of the 
lateral inhibition effect produced by the denominator term X]y^=i Q (^ll/') 
equation [S] 

The VQ and SOM results (see |11[3]) may be obtained as special cases of raw 
neuron firing rates Q (x|j/), where one neuron's firing rate is much larger than 
the other M — \ neurons' firing rates (i.e. there is effectively only one neuron 
that can fire, so it is the winner-take- all). 



2.3 Partitioned Posterior Probability (Multiple Recogni- 
tion Models) 

The form of the posterior probability Pr(y|x) introduced in equation [5] is un- 
suitable for networks with a large number of neurons Af, because the lateral 
inhibition is global rather than local. This can readily be inferred because the 
denominator term X^y-^^i Q i^W) equation [5] computes a quantity that is the 
sum over all of the raw neuron firing rates. 

This problem can be amended by defining a localised posterior probability 
Pr(y|x;2/') as 

2^y"&J\f{y') ^ ) 

where J\f [y') is the local neighbourhood of neuron y' , which is assumed to 
contain at least neuron y', and Sy(zj\f(^y/) is a Kronecker delta that constrains 
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y to lie in the neighbourhood J\f{y'). If Af {y') contains all M neurons then 
Pr(?/|x;?/') reduces to Pr (yjx) as previously defined in equation [5] Pr(?/|x;?/') 
has the required normalisation property that ^^^i^i^ iy\^',y') = 1 for all y'. 
Because y' can take M possible values, there are M complete localised posterior 
probability functions Pr {y\yi;y'). In effect, the neural network is split up into M 
overlapping subnetworks (these subnetworks overlap where A/" (j/i) n TV (2/2) 7^ 
for yi ^ 2/2), each of which computes its own posterior probability function; 
note that any overlap between a pair of subnetworks causes the corresponding 
Pr(?/|x;?/') to be mutually dependent. 

It is not always convenient to use a neural network model in which there are 
M separate posterior probability models Pr (y|x;j/'). However, these M localised 
posterior probability functions Pr (j/|x;j/') (for the M different choices of y') may 
be averaged together to produce a single posterior probability function. Thus 
define Pr (?/|x) as 



Pr(y|x) = ^ J2 Pi-(yl^;2/') 



M 

where J\f^^ (y) is the inverse neighbourhood of neuron y defined as J\f^^ (y) = 
{y'\y €J\f{y')}- This definition has all of the properties of a posterior proba- 
bility function, including the normalisation property Xly=i ivl'^) — 1 (which 
may be derived by swapping the order of summations using the result Sy'eA^- 

'^y'=i Tliy^Miy') (' ' "^^^ form of the posterior probability given in equation 
[HI in which M individual posterior probabilities Pr (y|x;y') are averaged to- 
gether, can be rigorously justified from a Bayesian point of view (see appendix 
The averaging process produces the posterior probability that should be 
used when there are M contributing models (as specified by the Pr {y\x.;y') for 
y' = 1, 2, • • • , M) that have equal prior weight. The average over the M mod- 
els then simply marginalises over an unobserved degree of freedom (the model 
index y'). 

If this localised definition of Pr(j/|x) given in equation [5] is compared with 
the global definition given in equation [5] it is seen that the normalisation factor 
has been modified thus 

which alters its lateral inhibition properties, ^ ^, , — - is the lateral 

inhibition factor that derives from the neighbourhood of neuron y' , which gives 
rise to a contribution to the lateral inhibition factor for all neurons y in the 
neighbourhood of y' via the average -p- X]y'eA/'-i(y) (' ' ')• Thus the overall lateral 
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Figure 3: A partitioned mixture distribution (PMD) neural network. 

inhibition factor acting on neuron y is derived locally from those neurons y" that 
he in the set TV (A/""^ (y)) . The posterior probabihty model defined in equation[5] 
has been used before in the context of partitioned mixture distributions (PMDs) , 
where multiple mixture distribution models are simultaneously optimised [Tj [8] . 

Figure[3]shows the structure of the neural network corresponding to the PMD 
posterior probability in equation[8l Each output neuron has a raw receptive field 
of input neurons (which contains 5 input neurons in the example shown), and is 
also laterally inhibited by its neighbouring output neurons (the size of a neuron 
neighbourhood is 3 neurons to either side in the example shown). Note that 
the input-output links in figure [3] do not imply that the raw neuron firing rates 
Q (x|y) can be computed by using simple weighted connections; they are drawn 
merely to indicate the set of input neurons that influences the raw firing rate 
of each output neuron. Similarly, the output-output links in figure [3] are drawn 
to indicate the sizes of the output neuron neighbourhoods; the details of how 
lateral inhibition modifies the raw firing rates Q (x|y) of the output neurons to 
produce the probability Pr (?;|x) that neuron y fires first is given in equation |51 

For completeness, the PMD objective function in equation [3] may now be 
written out in full using the expression for the PMD posterior probability in 
equation [5] to yield (where the effects of leakage have been included, as defined 
in equation |6]) 



M M 



D ^ ^ [ dxPr(x)f]|J]Pr(y|y') 

J v=ly'=l 



xO(x|z/') 



' (x|2/" 



|x-x'(y)I 



(10) 



This is the objective function that will be used to characterise to performance 
of the neural networks in all of the computer simulations. 
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2.4 Derivatives of the Objective Function 

In order to minimise the PMD objective function in equation [TUl its derivatives 
must be calculated. First of all, define some convenient notation [lOJ 



Ly,y' = Pr(y'|y) 

Py = Sj/'eA^-i(j/) Py',y 
e,^||x-x'(y)||^ 



P ' = Pr fw'lx-iy) ^ 9(xb')VgAr(„ 



{L P)y =T,y'ec-^iy)^v' 



i-^^)y — J2y'(^c{y) ^y,y'^y' 
{Le\, (P^PLe)^ ^ T^y^u^^i^y) Py' ,v {PLe\- 



(11) 

where C (y) denotes the leakage neighbourhood of neuron y, which is the set of 
neurons that have posterior probability leaked onto them by neuron y, and the 
inverse leakage neighbourhood jC~^ (y) is defined as jC~^ (y) = S £(y')}. 

Assume that the raw neuron firing rates may be modelled using a sigmoid func- 
tion 

Q (x|y) 



1 



1 + exp {-w {y) -yi-b (y)) 
whence the derivatives may be obtained in the form ^lOj 



(12) 



dP 
dD 



My) 

w(2/) 



j dxPr(x) (L^p),Jx-x' [y)) 

Py {Le)y - {P^PLe) 
x(l-0(x|y)) 



4 

M 



— / dxPr(x 



(13) 



where the two derivatives off, and 75^7^ have been written together for com- 



pactness. 



dbiv) 



2.5 Receptive Fields 

The raw firing rate Q (x|y) of neuron y depends only on a subset x (y) of com- 
ponents of x; X (y) is thus the raw receptive field of neuron y. However, the 
posterior probability Pr (y|x) that neuron y fires first is derived from Q (x|y) by 
weighting it with a lateral inhibition factor that depends on the raw firing rates 
of all neurons in Af (Af~^ {y))i seen in equation |S1 so the overall receptive 
field of a neuron is rather broader than its raw receptive field. The effect of 
leakage, as defined in equationlH is to broaden the overall receptive field further 
still. The optimal reference vector x' (y) has non-trivial structure only within 
this overall receptive field, so inside the overall receptive field the components 
of x' (y) must be subjected to an optimisation procedure to discover their op- 
timal form, whereas outside the overall receptive field the components of x' (y) 
may be set to be the average values of the corresponding components of the 
training vectors x (see the definition of x' (y) in equation HI which reduces to 
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x' (y) = / dxPr (x) x for those components of x' (y) that He outside the overall 
receptive field of neuron y). 

In the simulations that will be presented here a suboptimal approach is used, 
where only those components of x' (y) that lie inside the raw receptive field are 
optimised; this produces a least upper bound on the value of the objective 
function that would have been obtained if a full optimisation had been used. 
Also, it is assumed that the input data has been prepared in such a way that 
each component is zero mean. This is not actually a restriction, because the 
objective function is invariant with respect to adding a different constant to 
each component of x, because it is a function of the difference x — x'. In this 
suboptimal approach, and with the zero mean assumption, the components of 
x' (y) that lie outside the raw receptive field of neuron y will be set to zero. 

The fact that the components of x' (y) that lie outside the raw receptive 
field of neuron y are zero may be used to simplify the evaluation of the various 
terms and in equation [T^ Thus evaluate Py {Le)^ — {P^PLe)y by 

expanding ey as 

ey ^ ||xf -2x.x'(y) + ||x'(2/)f 

^ ||xf +x'(y)-(x'(y)-2x) (14) 

which is a sum of a constant (i.e. does not depend on y) term ||x||^ and a term 
x' (y) • (x' (y) — 2x) that does depend on y. What happens to the constant term 
when it is substituted into py {Le)y — {P"'" PLe)yl 

Py{Le)y^{P^PLe)y ^ Py {L ■ 1)^ - {P^ PL ■ l)y 

= Py^ Py 

= (15) 

It cancels out, so Cy might as well be replaced as follows in py {Le)y — {P^ PLe)y 

ey ^ x' (y) • (x' (y) - 2x) (16) 

Because the components of x' (y) that lie outside the raw receptive field of 
neuron y are set to zero, the x' (y) •(•••) operation effectively projects out any 
components of (• • •) that happen to lie outside this raw receptive field. This 
means that the only components of x in equation [1^] that survive are those that 
lie inside the raw receptive field, so effectively Cy depends only on quantities 
that lie inside the raw receptive field of neuron y. Note that a full optimisation 
of x' (y), in which all components that lie inside the overall receptive field of 
neuron y are optimised, would produce a different result. 

3 Dominance Stripes and Orientation Maps 

The purpose of this section is to discuss the two phenomena of dominance 
stripes and orientation maps. In section [XT] a brief review of the popular elastic 
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Figure 4: An elastic net oscillating back and forth in ocularity between a pair 
of retinae. 

net model of dominance stripes is presented, and in section 13.21 an informal 
derivation of the origin of both dominance stripes and orientation maps is given. 

3.1 Review of Dominance Stripes Using the Elastic Net 
Model 

The results that will be presented here are, broadly speaking, equivalent to the 
way in which ocular dominance stripes are obtained in the elastic net model (as 
reviewed in [2j[T4j) as applied to a pair of retinae. The essential features of this 
type of model of ocular dominance are shown in figure |4] (which is copied from 
[2]). The left and right retinae are represented as 1-dimensional lines of units 
at the top and bottom of the diagram. The horizontal dimension represents 
distance across a retina, and the vertical dimension represents the ocularity 
degree of freedom. The distance between any two retinal units, either within or 
between retinae, represents the correlation between those two units [2]. Thus 
the ratio ^ determines the relative strength of the inter-retinal and intra-retinal 
correlations. The elastic net is represented by the line oscillating back and forth 
between the retinae. The net effect of the elastic net algorithm is to encourage 
the elastic net to pass as close as possible (in a well-defined sense) to all of 
the retinal units, and also to minimise its total length. These are conflicting 
requirements, and the oscillatory solution shown in figure H] is typical of an 
optimal elastic net configuration, which thus predicts an oscillatory pattern of 
ocular dominance (i.e. which corresponds to dominance stripes in the case of 
2-dimensional retinae) . 

This type of model inevitably leads to dominance stripe formation, because 
the elastic net model separates the input components into two clusters (see figure 
13]) according to whether they belong to the left or right retina. In effect the 
output layer of the network is explicitly told which retina an input component 
belongs to, and this fact is expressed by the position of the component along the 
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Figure 5: Neural network model with a limited receptive field. 

ocularity dimension. The goal in this paper is to construct a more natural model 
of dominance stripe formation, in which the ocularity dimension is revealed by a 
process of self-organisation, rather than being hard-wired into the model. Thus, 
the visual cortex model that is presented in this paper will not explicitly label 
the input pixels as belonging to the right or left retina (as they are in figure 2]), 
but will have to deduce their left/right retina membership from the properties 
of the training set instead. 

3.2 Informal Derivation of Dominance Stripes and Orien- 
tation Maps 

The purpose of this section is to present a simple picture that makes it clear 
what types of behaviour should be expected from neural network that minimises 
the objective function in equation 1101 

3.2.1 Neural Network Model 

It is assumed that each of the output neurons has only a limited receptive field 
of input neurons within each of the two retinae. In effect, this is a hand-crafted 
version of a "wire length" constraint, which ensures that the total length of the 
input-to-output connections is limited. In the context of the elastic net model 
this corresponds to the limited range of interaction between retinal units (the 
input) and elastic net units (the output). Also, it is assumed that sigmoidal 
neurons with local probability leakage are used, which generates an effect that 
is analogous to the elastic tension in the elastic net model, because it encourages 
neighbouring neurons to adopt similar parameter values. 

This model is drawn in figure [S] in an analogous way to the elastic net model 
in figure m In this model the ocularity dimension is not explicitly present, and 
the elasticity (of the elastic net) is replaced by the probability leakage mechanism 
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that enables neighbouring output neurons to communicate with each other. The 
separation of input neurons into left and right retinae in figure [5] is made only 
for comparison between figure [S] and the elastic net model in figure 21 When the 
left and right receptive fields are presented to the output neuron, all information 
about which retina the various input neurons belong to has been discarded; all 
input neurons within the left and right receptive fields are treated on an equal 
basis. The ocularity dimension will emerge by a process of self-organisation 
driven by the statistical properties of the images received by the left and right 
retinae. 

3.2.2 Very Low Resolution Input Images 

The simplest situation is when there are two retinae (as in the above elastic net 
model), each of which senses independently a featureless scene, i.e. all the units 
in a retina sense the same brightness value, but the two brightnesses that the 
left and right retinae sense are independent of each other. This situation would 
arise if the images projected onto the two retinae were very low resolution, so all 
spatial detail is lost. This limits the input data to lying in a 2-dimensional space 
B?. If these two featureless input images (i.e. left and right retinae) are then 
normalised so that the sum of left and right retina brightness is constrained to 
be constant, then the input data is projected down onto a 1-dimensional space 
i?^, which effectively becomes the ocularity dimension. If each of the M output 
neurons had an infinite-sized receptive field, then the optimal network would be 
the one in which the M neurons cooperate to give the best soft encoding of . 

However, because of the limited receptive field size and output neuron neigh- 
bourhood size, the neurons can at best co-operate together a few at a time (this 
also depends on the size of the leakage neighbourhood). If the network proper- 
ties are translation invariant this leads to an optimal network whose properties 
fluctuate periodically across the network (see appendix [B]), where each period 
typically contains a complete repertoire of the computing machinery that is 
needed to process the contents of a receptive field; this effect is called com- 
pleteness, and it is a characteristic emergent property of this type of neural 
network. 

The only unexplained step in this argument is the use of a normalisation 
procedure on the input. However, if the input to this network is the PMD 
posterior probability computed by the output layer of another such network, 
then there is already such a normalisation effect induced by the lateral inhibition 
within the PMD posterior probability. For featureless input images, this lateral 
inhibition effect causes precisely the type of normalisation that is used above 
(i.e. left plus right retina brightness is constant) to occur naturally. 

These results are summarised in figure |6] where the ocularity dimension runs 
from (0,1) to (1,0), and a typical set of neural reference vectors is shown. 
The oscillation of these reference vectors back and forth along the ocularity 
dimension corresponds to the oscillations of the elastic net that are represented 
in figure m 
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Figure 6: Typical neural reference vectors for very low resolution input images. 
3.2.3 Low Resolution Input Images 

A natural generalisation of the above is to the case of not-quite-featureless input 
images. This could be brought about by gradually increasing the resolution of 
the input images until it is sufficient to reveal spatial detail on a size scale equal 
to the receptive field size. Instead of seeing a featiucless input, each neuron 
would then see a brightness gradient within its receptive field. This could be 
interpreted by considering the low order terms of a Taylor expansion of the input 
image about a point at the centre of the neuron's receptive field: the zeroth term 
is local average brightness (which lives on a 1-dimensional line R^), and the two 
first order terms are the local brightness gradient (which lives in a 2-dimensional 
space R"^). When normalisation is applied this reduces the space in which the 
two images live to R^ x R"^ x R? {R} from the zeroth order Taylor term with 
normalisation taken into account, R?' from the first order Taylor terms, counted 
twice to deal with each retina) . 

The R^ from the zeroth order Taylor term gives rise to ocular dominance 
stripes, which thus causes the left and right retinae to map to different stripe- 
shaped regions of the output layer. The remaining i?^ x R^ then naturally splits 
into two contributions (left retina and right retina), each of which maps to the 
appropriate stripe. If the stripes did not separate the left and right retinae, 
then the R^ x R^ could not be split apart in this simple manner. Finally, since 
each ocular dominance stripe occupies a 2-dimensional region of the output 
layer, a direct mapping of the corresponding R^ (which carries local brightness 
gradient information) to output space can be made. As in the case of dominance 
stripes alone, the limited receptive field size and output neuron neighbourhood 
size causes the neurons to co-operate together only a few at a time, so that 
each local patch of neurons contains a complete mapping from R? to the 2- 
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Figure 7: Typical neural reference vectors for low resolution input images. 

dimensional output layer. 

These results are summarised in figure [7] where the pure oscillation back and 
forth along the ocularity dimension that occurred in figure |6] develops to reveal 
some additional degrees of freedom, only one of which is represented in figure [7] 
(it is perpendicular to the ocularity axis). 

If the leakage is reduced then the oscillation back and forth along the dom- 
inance axis tends to be more like a square wave than a sine wave, in which 
case figure [7] becomes as shown in figure [8] where the neural reference vectors 
are bunched near to the points (0, 1) and (1,0), and explore the additional de- 
gree(s) of freedom at each end of the ocularity axis. In the extreme case, where 
the ocularity switches back and forth as a square wave, the neurons separate 
into two clusters, one of which responds only to the left retina's image and the 
other to the right retina's image. Furthermore, within each of these clusters, the 
neurons explore the additonal degree(s) of freedom that occur within the cor- 
responding retina's image. Note only one such degree of freedom is represented 
in figure ISl it is perpendicular to the ocularity axis. 

The above arguments can be generalised to the case of input images with 
fine spatial structure (i.e. lots of high order terms in the Taylor expansion are 
required). However, more and more neurons (per receptive field) are required 
in order to build a faithful mapping from input space to a 2-dimensional repre- 
sentation in output space. For a given number of neurons (per receptive field) a 
saturation point will quickly be reached, where the least important detail (from 
the point of view of the objective function) is discarded, keeping only those prop- 
erties of the input images that best preserve the ability of the neural network 
to reconstruct its own input with minimum Euclidean error (on average). 
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Figure 8: Typical neural reference vectors for low resolution input images, where 
reduced leakage causes the ocularity to switch abruptly back and forth. 

4 Simulations 

Two types of training data will be used: synthetic, and natural. Synthetic 
data is used in order to demonstrate simple properties of the neural network, 
without introducing extraneous detail to complicate the interpretation of the 
results. Natural data is used to remove any doubt that the neural network is 
capable of producing interesting and useful results when it encounters data that 
is more representative of what it might encounter in the real world. 

In section 14.11 dominance stripes are produced from a 1-dimensional retina, 
and in section 14.21 these results are generalised to a 2-dimensional retina. In 
both cases both synthetic and natural image results are shown. In section l4?3l 
orientation maps are produced for the case of two retinae trained with natural 
images. 

4.1 Dominance Stripes: The 1-Dimensional Case 

The purpose of the simulations that are presented in this section is to demon- 
strate the emergence of ocular dominance stripes in the simplest possible realistic 
case. The results will correspond to the situation outlined in figure [HI 

4.1.1 Synthetic Training Data 

The purpose of this simulation is to demonstrate the emergence of ocular dom- 
inance stripes, of the type that were shown in figure |6l by presenting a model 
of the type shown in figure [5] with very low-resolution input images. In fact, 
the resolution is so low that each image is entirely featureless, so that all the 
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neurons in a retina have the same input brightness, but the two retinae have 
independent input brightnesses. These input images are normahsed by process- 
ing them so that they look like the PMD posterior probabihty computed by 
the output layer of another such network; the neighbourhood size used for this 
normahsation process was chosen to be the same as the network's own output 
layer neighbourhood size. 

In the first simulation the parameters used were: network size = 30, receptive 
field size = 9, output layer neighbourhood size = 5 (centred on the source 
neuron) , leakage neighbourhood size = 5 (centred on the source neuron) , number 
of training updates = 2000, update step size = 0.01. For each neuron the leakage 
probability had a Gaussian profile centred on the neuron, and the standard 
deviation was chosen as 1, to make the profile fall from 1 on the source neuron 
to exp (—1/2) on each of its two closest neighbours. 

The update scheme used was a crude gradient following algorithm parame- 
terised by three numbers which controlled the rate at which the weight vectors, 
biasses and reference vectors were updated. These three numbers were contin- 
uously adjusted to ensure that the maximum rate of change (as measured over 
all the neurons in the network) of the length of each weight vector, and also the 
maximum rate of change of the absolute value of each bias, was always equal 
to the requested update step size; this prescription will adjust the parameter 
values until they jitter around in the neighbourhood of their optimum values. 
The optimum reference vectors could in principle be completely determined us- 
ing equation 13] for each choice of weights and biasses, but it is not necessary for 
the reference vectors to keep in precise synchrony with the weights and biasses. 
Rather, the reference vectors were controlled in a similar way to the weight vec- 
tors, except that they used three times the update step size, which made them 
more agile than the weights and biasses they were trying to follow. 

The ocular dominance stripes that emerge from this simulation are shown 
in figure |9l The ocularity for a given neuron was estimated by computing 
the average of the absolute deviations (as measured with respect to the overall 
mean reference vector component value, which is zero for the zero mean training 
data that is used here) of its reference vector components within its receptive 
field, both for the left retina and the right retina. This allows two plots to 
be drawn: average value of absolute deviations from the mean in left retina's 
receptive field as a function of position across the network, and similarly the 
right retina's receptive field. As can be seen in figure [9l these two curves are 
approximately periodic, and are in antiphase with each other; this corresponds 
to the situation shown in figure [SI The amplitude of the ocularity curves is 
less than the 0.5 that would be required for the end points of the ocularity 
dimension to be reached, because one of the effects of leakage is to introduce a 
type of elastic tension between the reference vectors that causes them to contract 
towards zero ocularity. Note how the ocular dominance curves have a period of 
approximately 7, which is slightly greater than the output layer neighbourhood 
size (which is 5). In the limit of zero leakage and infinite receptive field size 
the period would be equal to the output layer neighbourhood size, in order to 
guarantee that a complete set of processing machinery is contained within each 
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Figure 9: 1-dimeiisional dominance stripes after training on synthetic data. 
<Ref. Vect.> 
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Figure 10: 1-dimensional square wave dominance stripes after further training 
with reduced probabihty leakage on synthetic data. 

output layer neighbourhood size; this effect is called completeness. 

If the above simulation is continued for a further 2000 updates with a reduced 
leakage, by reducing the standard deviation of the Gaussian leakage profile from 
1 to 0.5, then the ocular dominance curves become more like square waves than 
sine waves, as shown in figure [TOl this is similar to the type of situation that 
was shown in figure [51 except that the input images are featureless in this case. 



4.1.2 Natural Training Data 

Figure [TT] shows the Brodatz texture image [T] that was used to generate a 
more realistic training set than was used in the synthetic simulations described 
above. Figure[T2lshows an enlarged portion of figure[TTl where it is clear that the 
characteristic length scale of the texture structure is in the range 5 — 10 pixels. 
This is large enough compared to the receptive field size (9) and the output layer 
neighbourhood size (5) that a simulation using 1-dimensional training vectors 
extracted from this 2-dimensional Brodatz image will effectively see very low 
resolution training data, and should repond approximately as described in figure 
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Figure 13: 1-dimensional dominance stripes after training on natural data. 
<Ref. Vect.> 
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Figure 14: 1-dimensional square wave dominance stripes after further training 
with reduced probabihty leakage on natural data. 

El 

The results corresponding to figure [9] and figure [10] are shown in figure [13] 
and figure I14[ respectively. The general behaviour is much the same in the 
synthetic and Brodatz cases, except that the depth of the ocularity fluctuations 
is somewhat less in the real case, because in the Brodatz case the training data 
is not actually featureless within each receptive field. 

4.2 Dominance Stripes: The 2-Dimensional Case 

This section extends the results of the previous section to the case of 2-dimensional 
neural networks. The training schedule(s) used in the simulations have not been 
optimised. Usually the update rate is chosen conservatively (i.e. smaller than it 
needs to be) to avoid possible numerical instabilities, and the number of training 
updates is chosen to be larger than it needs to be to ensure that convergence has 
occurred, ft is highly likely that much more efficient training schedules could 
be found. 
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Figure 15: 2-diinensional dominance stripes after training on synthetic data. 

4.2.1 Synthetic Training Data 

The results that were presented in figure[9]may readily be extended to the case of 
a 2-dimensional network. The parameters used were: network size = 100 x 100, 
receptive field size = 3x3 (which is artificially small to allow the simulation 
to run faster), output layer neighbourhood size = 5x5 (centred on the source 
neuron), leakage neighbourhood size = 3x3 (centred on the source neuron), 
number of training updates = 24000 (dominance stripes develop quickly, so far 
fewer than 24000 training updates could be used), update step size = 0.001. 
For each neuron the leakage probability had a Gaussian profile centred on the 
neuron, and the standard deviations were chosen as 1 x 1, to make the profile fall 
from 1 on the source neuron to exp (—1/2) on each of its four closest neighbours. 

Apart from the different parameter values, the simulation was conducted in 
precisely the same way as in the 1-dimensional case, and the results for ocu- 
lar dominance are shown in figure 1151 where ocularity has been quantised as 
a binary-valued quantity. These results show the characteristic striped struc- 
ture that is familiar from experiments on the mammalian visual cortex. The 
behaviour near to the boundary depends critically on the interplay between the 
receptive field size(s) and the output layer neighbourhood size(s). 

4.2.2 Natural Training Data 

The simulation, whose results were shown in figure [TSl may be repeated using 
the Brodatz image training set shown in figure [TI] to yield the results shown in 
figure [ini These results are not quite as stripe-like as the results in figure [TSl 
because in the Brodatz case the training data is not actually featureless within 
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Figure 16: 2-dimensional dominance stripes after training on natural data. 

each receptive field. 

4.3 Orientation Maps 

The purpose of the simulations that are presented in this section is to demon- 
strate the emergence of orientation maps in the simplest possible realistic case. 
In the case of two retinae, the results will correspond to the situation outlined 
in figure [7] (or, at least, a higher dimensional version of that figure). 

4.3.1 Orientation Map (One Retina) 

In this simulation the parameters used were: network size = 30 x 30, receptive 
field size = 17 x 17, output layer neighbourhood size = 9x9 (centred on the 
source neuron), leakage neighbourhood size = 3x3 (centred on the source 
neuron), number of training updates = 24000, update step size = 0.01. For each 
neuron the leakage probability had a Gaussian profile centred on the neuron, 
and the standard deviations were chosen as 1 x 1, to make the profile fall from 
1 on the source neuron to exp (—1/2) on each of its four closest neighbours. 

Note that both the receptive field size and the output layer neighbourhood 
size are substantially larger than in the 2-dimensional dominance stripe simu- 
lations, because many more neurons are required in order to allow orientation 
maps to develop than to allow dominance stripes to develop; in fact it would be 
preferable to use even larger sizes than were used here. To limit the computer 
run time this meant that the overall size of the neural network had to be reduced 
from 100 X 100 to 30 x 30. The training set was the Brodatz texture image in 
figure [TT] 
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Figure 17: Orientation map after training on natural data. 




Figure 18: Typical input, output and reconstruction produced by the orientation 
map. 

The results are shown in figure [17] where the receptive fields have been gath- 
ered together in a montage. There is a clear swirl-like pattern that is character- 
istic of orientation maps. Each local clockwise or anticlockwise swirl typically 
circulates around an unoriented region. 

4.3.2 Using the Orientation Map 

In figure [18] the orientation map network shown in figure [17] is used to encode 
and decode a typical input image. On the left of figure [TS] the input image (i.e. 
x) is shown, in the centre of figure [TH] the corresponding output (i.e. its PMD 
posterior probability Pr (y|x)) produced by the orientation map is shown, and on 
the right of figurefTBIthe corresponding reconstruction (i.e. X]^=i iul^) ^' (?/)) 
is shown. 

The output consists of a number of isolated "activity bubbles" of posterior 
probability, and the reconstruction is a low resolution version of the original 
input. The form of output is familiar as a type of "sparse coding" of the input. 
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where only a small fraction of the neurons participate in encoding a given input 
(this type of transformation of the input is central to the work that was reported 
in [13] )• This type of encoding is very convenient because it has effectively trans- 
formed the input into a small number of constituents each of which corresponds 
to an activity bubble, rather than transforming the input into a representation 
where the output activity is spread over all of the neurons, which is thus not 
easily interpretable as arising from a small number of constituents. 

The reconstruction has a lower resolution than the input because there are 
insufficient neurons to faithfully record all the information that is required to 
reconstruct the input exactly (e.g. probability leakage causes neighbouring neu- 
rons to have a correlated response, thus reducing the effective number of neurons 
that are available) . The featureless region around the edge of the reconstruction 
is an artefact, which occurs because fewer neurons (per unit area) contribute to 
the reconstruction near the edge of the input array. 

4.3.3 Orientation Map (Two Retinae) 

The above orientation map results may be generalised to the case of two retinae. 
The parameter values used were the same, apart from the standard deviation 
of the leakage Gaussian which was reduced to 0.5 x 0.5 in order to allow more 
detailed structure to develop in the adaptive parameter values of the output 
neurons. This is necessary because the presence of two retinae causes dominance 
stripes to develop, which allows only half of the neurons to be allocated to each 
retina, so a complete repertoire of computing machinery must be forced into 
half the number of neurons that were used in the case of one retina. 

The results are shown in figure [19] where the receptive fields for the left 
and right retinae have been used to create a colour separation in which one 
retina is coded as blue and the other as yellow. Within each retina there is 
a long-scale periodic fluctuation in overall brightness which corresponds to the 
dominance stripes. Within each dominance stripe there is the characteristic 
swirl-like pattern of the orientation map. Note that the unoriented regions 
typically occur at the centre of dominance stripes, as observed in the visual 
cortex; this can be understood intuitively by referring to figure [S] 

A larger simulation would be required in order to accurately estimate the 
detailed orientation map as a vector flow field. Such simulations could be used 
to verify whether the iso-orientation contours typically lie perpendicular to the 
dominance stripe boundaries, as observed in the visual cortex. The dominance 
stripe structure that appears in this simulation is not as distinct as the stripes 
in figure 1161 This is not a fundamental problem, but rather it is a result of the 
limited size of computer simulation that could be run in a reasonable length 
of time. It should also be noted that the dominance stripes that are observed 
in the visual cortex are sometimes more blob- like than stripe- like [2], so it 
is pleasing that different choices of parameter value should yield a variety of 
degrees of stripiness in our simulations. 
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Figure 19: Orientation map and dominance stripes after training on natural 



5 Conclusions 

This paper has shown how folded Markov chains (FMCs) [6] can be com- 
bined with partitioned mixture distributions (PMDs) [7] to yield a class of self- 
organising neural networks that has many of the properties that are observed 
in the mammalian visual cortex O [14] , which are thus called visual cortex net- 
works (VICON). These neural networks differ from previous models of the visual 
cortex, insofar as they model the neuron behaviour in terms of their individual 
firing events, and operate in the real space of input images rather than a hand- 
crafted abstract space, and the use of Bayesian methods makes the nature of the 
network's computations clearer than in the case where the network behaviour 
is simply postulated. When the neural network structure (e.g. receptive field 
size) parameters are appropriately chosen, dominance stripes and orientation 
maps emerge naturally when the network is trained on a natural image (e.g. a 
Brodatz texture image). 

These results show how this type of network is capable of self-organising its 
internal parameters in familiar ways when trained on data from multiple sources 
(actually, only two sources in the case of the visual cortex- like network) . The 
same network objective function could be used when an arbitrary number of 
data sources is presented, and it is anticipated that it would lead to analogous 
results. 

An extension of the network objective function to the case where sets of 
multiple neural firing events are considered has been published [HI HH] , and an 
extension to the case of a multilayer network has been published [T^. When 
combined, these extensions could be applied to the problem of the processing of 
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data from multiple sensors (i.e. data fusion). 



A Bayesian PMD 

In this section a fully Bayesian interpretation of a partitioned mixture distribu- 
tion (PMD) will be presented. 

Consider the general problem of computing a posterior probability Pr (?/|x) 
over classes y given an input vector x. If there is more than one model k then 
Pr (?/[x) is given by a marginal PDF 

Pr(2/|x)-^Pr(y,fc|x) (17) 



where Pr (y, fc|x) is the joint PDF of class y and model k given an input vector 
X. Bayes' theorem may be used to rewrite this as follows 

v, / ; I ^ Pr(y,fc,x) 
Pr y,fcx = „ , . 

Pr (x) 

Pr (j/|/c,x) Pr(fc,x) 



Pr (x) 

= Pr(y|A:,x)Pr(A:) (18) 

where Pr (fc, x) = Pr (fc) Pr (x) (i.e. independence of model k and data vector 
x) has been assumed in the last step. Thus the posterior probability Pr(?/|x) 
may be written as 

Pr(y|x) = ^Pr(y|A:,x)Pr(/c) (19) 

k 

Assume that there are M models, and that the prior probabilities Pr (fc) of 
the various models are equal, so that Pr (fc) = -jg, in which case the posterior 
probability reduces to 



1 

Pr(2/|x) = — ^Pr(y|/c,x) (20) 

fc=i 

which is an average of M contributing posterior probabilities (one from each 
of the contributing models). The PMD posterior probability in equation [S] is a 
special case of this result. 

More generally, the prior probabilities Pr (fc) are fc-dependent, and might be 
chosen in some optimal fashion to best handle the training set. The simplest 
way of determining an optimal Pr (fc) is to minimise D with respect to Pr (fc); 
this merely extends the space in which D is optimised to include more of the 
parameters inside Pr (j/|x). 
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B Optimal Solutions 



In this section the the objective function D will be minimised in the case where 
the input space consists of one or more subspaces, within each of which all of 
the input vector components have the same value. In the language of imaging 
sensors, these special cases correspond to each sensor viewing a featureless scene 
(i.e. all pixels having the same brightness value), which is effectively the lowest 
order term in a Taylor expansion of the spatial variation of pixel brightness 
values. This might not appear to be an interesting scenario to consider, but it 
leads to a highly non-trivial optimal network behaviour when D is minimised. 
More complicated input statistics leads to even more complicated optimal net- 
work behaviour, so only the simplest case described above will be considered at 
first. 

B.l One Input Subspace 

This may be used to optimise the network for a single sensor viewing a featureless 
scene. For a d-dimensional input space Pr (x) is thus given by 

d 

Pr(x)=Pr(a;i)n^(^. -^i) (^1) 

i=2 

whence the objective function D in equation [3] reduces to 

„ M 

D^2d dxi Pr (xi) Pr (y^i) (x^ - x[ {y)f (22) 

This is d times the objective function for a 1-dimensional soft scalar quantiser 
which encodes inputs in xi-space whose PDF is Pr (xi). 

B.2 Two Input Subspaces 

This may be used to optimise the network for a pair of sensors each of which 
views a featureless scene, and which are possibly correlated with each other. 
The one input subspace case above can readily be generalised to more input 
subspaces. Let the d-dimensional input space be split into two ^-dimensional 
subspaces, where Pr (x) is given by 

d 
2 

Pr (x) = Pr (xi,X2) ]^<5(a;2j-i - xi) S ix2j - X2) (23) 

where one of the subspaces consists of the odd-numbered components, and the 
other the even-numbered components of the input vector (this particular order- 
ing of the components is not important). Whence the objective function D in 
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equation [3] reduces to 



D = d dxidx2 Pr {xi,X2) Pr {y\xi,X2) 

X ((xi - x'^ iy)f + {X2 - 4 {y)f) (24) 

This is I times the objective function for a 2-dimensional soft vector quantiser 
which encodes inputs in (a;i, a;2)-space whose PDF is Pr(a;i,X2). This result 
generalises in the obvious way to a larger number of input subspaces. 

B.3 PMD Posterior Probability 

In the above special cases each neuron potentially responds to all of the compo- 
nents of the input vector. If this were to be built in hardware, then each neuron 
would have a number of inputs equal to the dimensionality of the input space, 
which becomes unwieldy if the input space had a high dimensionality (e.g. an 
image) . For high-dimensional inputs it is sensible to limit the number of inputs 
to each neuron, which can readily be implemented by imposing a finite-sized 
receptive field on the input of each neuron, such that it can respond only to a 
limited subset of all of the input vector components. This constraint will pre- 
vent the ideal vector quantiser solutions from being obtained, so the purpose of 
this section is to derive the constrained optimal solution. Note that this type of 
input is a special case of the type of solution that would be obtained by adding 
a "wire-length" penalty term to the objective function in order to penalise the 
connection of a neuron to too many input components. 

Even if receptive fields are used to restrict the length of the input con- 
nections, the posterior probability Pr (y|x) effectively needs long-range lateral 
connections between the output neurons in order to implement the normalisa- 
tion condition X^yli -P'^ = 1- The simplest example of this is the standard 
vector quantiser, whose winner-take-all property requires that all neurons are 
laterally connected to all other neurons even if each of them has only a finite- 
sized receptive field. A partitioned mixture distribution (PMD) posterior prob- 
ability, in which the posterior probability Pr (y|x) is only locally connected, can 
be used to ensure that all the connections in the network are local (see section 
[2311 . 

B.3.1 Receptive Fields 

Write the input vector as x = (x (y) , x (y)) where x (y) is the part of x that 
lies within the receptive field of neuron y, and, for simplicity, assume that the 
receptive field used for x' (y) is chosen to be the same as that for x (y) , and 
that all receptive fields see the same number w of input components. Because 
the input vector is split into two subspaces as x = (xi,X2), its decompositon as 
(x (y) , X (y)) may similarly be split into two subspaces as x (y) — (xi (y) , X2 (y)) 
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and x(2/) = (xi (y) ,X2 (y)). Use the orthogonality of x (y) and x (y) to write 
(for i = 1,2) 

||x, (y) + X, (y) - x^ (y)f = ||x, (y)f + (y) - x^ (y)f 
and simphfy D in equation [3] thus 



D = 2 f dxidx2Pr(xi,X2)^Pr(y|xi,X2) 

||xi(y)f + ||x2(y)f \ 

H|xi(y)-x;(y)f + ||x2(z/)-x^(y)f ^ ^ 

There are two terms to consider. 

1. ||xi (y)ll^ + ||x2 (y)ll^- This is the contribution from outside the y*^ recep- 
tive field, which is the L2 norm of those components of the input vector 
that lie outside the y*'* receptive field. 

2. ||xi (y) — x[ (y)ll^ + ||x2 (y) — Xj (y)||^: This is the contribution from in- 
side the y*'' receptive field, which is the L2 norm of those components of 
the error vector (i.e. input minus reconstruction) that lie inside the y*'' 
receptive field. 

B.3.2 Simplify the ||xi {y)f + ||x2 (y)f Term 

11^1 {y)\\^ + 11^2 (y)ir is the L2 norm of those components of the input vector 
that lie outside the y^^ receptive field, which is known once the input vector 
is specified. Furthermore, because of the assumed input PDF (i.e. all input 
components in each subspace have the same value), together with the assumed 
receptive field prescription (i.e. all receptive fields are the same size w) , this L2 
norm is independent of y given that x is known, so this term has the following 
contribution to D 



D = (d — w)(^j dxil?! (xi) x\ + J dx2Pi'{x2)x 



(26) 



where d — w is the number of input components that lie outside each receptive 
field. 

B.3.3 Simplify the (y) - x'^ (y)f + 11^2 (y) - xJ, (y)||' Term 

Assume that Pr (y|xi, X2) has the PMD form of a sum over mixture distribution 
posterior probabilities (as described in section [2?3| . so that 



Pr(y|xi,X2) = J7 Yj P^' ivl^U^^w') 



M 

i-g(ii,i2|y) J2 n(~ ~ \ »^ (27) 

M ^,^^^^^^E,„g^(^,)Q(xi,x2|y") 
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The overall receptive field that effects the value of Pr (y|xi, X2) (for a given y) 
may be read off this expression. Thus x (y") comprises those components of x 
that lie within the receptive field of neuron y", and the X^y/gj\/-i(y) — r 

operation compounds these x {y") so that the overall set of components of x 
that arc needed for the purposes of calculating Pr(j/|xi,X2) is given by (using 
a somewhat cavalier notation) 

x(y)= U U ^(2/") (28) 

v'eM-^{y)y"eJ^{y') 

The individual Pr (t/|xi , X2;j/') that contribute to Pr(i;|xi,X2) each depend on 
a smaller set of components of x than the full Pr (y |xi, X2), because there is one 
less summation over a y variable. However, it is convenient, and imposes no 
constraint, to use the full set of components thus 

Pr (t/|xi, X2; y') = Pr (y|Xi (y) , X2 (y) ; y') (29) 

The y and y' summations can bo interchanged using 'l2y'eAf-i{y) (' ' ') ~ 



Sj/'=i J2yeJ\r{y') (■'■)> whence the contribution to D is 

D = — /dxidx2Pr(xi,X2)^ Pr(2/|Xi(2/),X2(y);y') 

X (||xi (y) - x; {y)f + ||X2 (y) - x^ {y)f) (30) 

Because the components of Xj (y) are a subset of the components of Xj (j/)(for 
i = 1, 2), Pr (xi, X2) can be marginalised to yield 



o r 
^ = mY. E y '^Xi (j/) dX2 (y) Pr (Xi {y) , X2 (y)) 

y'=l yeMiy') 

xPr (y|Xi (j/),X2 {y);y') 

X (||xi (y) - x; (y)ll' + ||X2 (y) - x^ (y)f ) (31) 

Because Pr (xi, X2) specifies that all of the components in each subspace are the 
same, this contribution to D may be simplified to 

w " C 
^ = E E dxidx2Pv{xi,X2)Pv{y\xi,X2;y') 

X ((xi - x[ iy)f + ix2 - 4 {y)f) (32) 
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B.3.4 Periodic Optimal Solutions 

Combining the results from outside (equation 1261) and inside f equation I32p the 
receptive fields yields finally 

D = [d ~ w) (^j dxiVr {xi) x\ + J dx2PT{x2)x^ 

+ M X] X! dxidx2PT{xi,X2)PT{y\xi,X2;y') 

yl = l y£j\r(y') 

X - x[ {y)f + {x2 - 4 iy)f) (33) 

The first of these terms is constant, so it may be ignored insofar as network 
optimisation is concerned. The second term is much more interesting, ft is the 
sum of the objective functions of a large number of 2-dimensional soft vector 
quantisers. However, these objective functions cannot be optimised indepen- 
dently of each other, because the posterior probabilities Pr (j/jxi, X2; y') force 
the neurons to share parameters with each other. 

Drop the constant term, and interchange the order of summation to obtain 

M 



D = / dxi da:2 Pr (xi, 0:2) Pr (?/|a;i, X2) 

X ((xi - x\ {y)f + {X2 - 4 {y)f) (34) 



where Pr (j/jxi, 2:2) is the PMD posterior probability given by 

Pr(y|a;i,a;2) = ^ Pr {y\xi,X2]y') (35) 

y'eJ\f-^{y) 

Now suppose that Pr {y\xi, X2) and x[ (y) have the periodicity property 

Pr (y + m|xi,X2) = Pr (y|a;i, a;2) 

x[{y^m) = x[{y) (36) 

where the fact that y is restricted to 1 < y < M has been ignored for simplicity, 
then D can be simplified thus (again, ignoring the fact that y is restricted to 
1 < y < M) 



^-1 m(yo + l) 

D = w 

yo=0 y=myo + l 



J dxidx2Pr{xi,X2)Pr{y\xi,X2) 

yo=0 y=myo + l 

X ((2^1 - x[ {y)f + {x2 - 4 (y))^) 

w} I dxidx2PT: {xi,X2) — Pr(y|a;i,X2) 
^ — ' / m 
y=^-' 

X ((xi - x\ {yjf + {X2 - x', iy)f) (37) 
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where ^ Pr X2) = 1 follows from X)y=i P^" 2^2) = 1 and the 

periodicity property, so ^ Pr {y\xi,X2) serves as a posterior probability for 1 < 
y < TO. 

This demonstrates that if the optimal solution is periodic, with period m, 
then the objective function is proportional to the objective function for a 2- 
dimensional soft vector quantiser with m neurons. Note that thus far nothing 
has been said about the actual value of to; its optimal value depends on the 
interplay between the receptive field size(s), the output layer neighbourhood 
sizo(s), and the leakage neighbourhood size(s). Because this type of periodic 
solution is essentially a set of overlapping to neuron soft vector quantisers, each 
set of TO neurons will typically exhibit the properties of such quantisers. In 
particular this means that each set of m neurons will have the means to encode 
and (approximately) reconstruct those components of the input vector that it 
sees via its receptive fields. 

This type of solution is the archetype for orientation maps, where the neu- 
rons arrange their properties so that each local patch (corresponding to the rn, 
neurons in the periodic solution derived above) has the means to encode what- 
ever orientation of object it sees via its receptive fields. The full derivation of 
an orientation map would require a more sophisticated analysis than the simple 
1-dimensional case derived above. 
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